The goal of midterm is to apply some of the methods for supervised and unsupervised analysis to a new dataset. We will work with data characterizing the relationship between wine quality and its analytical characteristics available at UCI ML repository as well as in this course website on canvas. The overall goal will be to use data modeling approaches to understand which wine properties influence the most wine quality as determined by expert evaluation. The output variable in this case assigns wine to discrete categories between 0 (the worst) and 10 (the best), so that this problem can be formulated as classification or regression – here we will stick to the latter and treat/model outcome as continuous variable. For more details please see dataset description available at UCI ML or corresponding file in this course website on canvas. Please note that there is another, much smaller, dataset on UCI ML also characterizing wine in terms of its analytical properties – make sure to use correct URL as shown above, or, to eliminate possibility for ambiguity, the data available on the course website in canvas – the correct dataset contains several thousand observations. For simplicity, clarity and to decrease your dependency on the network reliability and UCI ML availability you are advised to download data made available in this course website to your local folder and work with this local copy.
There are two compilations of data available under the URL shown above as well as in the course website in canvas – separate for red and for white wine – please develop models of wine quality for each of them, investigate attributes deemed important for wine quality in both and determine whether quality of red and white wine is influenced predominantly by the same or different analytical properties (i.e. predictors in these datasets). Lastly, as an exercise in unsupervised learning you will be asked to combine analytical data for red and white wine and describe the structure of the resulting data – whether there are any well defined clusters, what subsets of observations they appear to represent, which attributes seem to affect the most this structure in the data, etc.
Finally, as you will notice, the instructions here are terser than in the previous homework assignments. We expect that you use what you’ve learned in the class to complete the analysis and draw appropriate conclusions based on the data. All approaches that you are expected to apply here have been exercised in the preceeding weekly assignments – please feel free to consult your submissions and/or official solutions as to how they have applied to different datasets. As always, if something appears to be unclear, please ask questions – we may change to private mode those that in our opinion reveal too many details as we see fit.
Download and read in the data, produce numerical and graphical summaries of the dataset attributes, decide whether they can be used for modeling in untransformed form or any transformations are justified, comment on correlation structure and whether some of the predictors suggest relationship with the outcome.
Briefly going through the following links [http://onlinelibrary.wiley.com/doi/10.1002/9781118730720.fmatter/pdf] , [http://winefolly.com/review/understanding-acidity-in-wine/] and other literature online and with some basic knowledge we can hypohesize the following about main attributes that effect the quality of wine. 1. acidity (fixed acidity,volatle acidity, citric acid etc..) 2. sugal levels (residual sugar) 3. ph - This is a measure of acidity 4. alochol (values-level) we will first look at the data. Remove null values. then analyze single attribute and hwo each of them compare to quality and then we will perform pair wise analysis.
#Read sample data
#wr- red wine
#ww- white wine
setwd("/Users/RaviRani/Documents/Harvard-Extension/CSCI E-63/midterm")
wr<-read.table("winequality-red.csv",sep=";",header=TRUE)
ww<-read.table("winequality-white.csv",sep=";",header=TRUE)
#head of red wine & white wine
head(wr)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
head(ww)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.0 0.27 0.36 20.7 0.045
## 2 6.3 0.30 0.34 1.6 0.049
## 3 8.1 0.28 0.40 6.9 0.050
## 4 7.2 0.23 0.32 8.5 0.058
## 5 7.2 0.23 0.32 8.5 0.058
## 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
# column names of white & red wine
colnames(wr)
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
colnames(ww)
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
# Dimension of red & white wines before removing null values
#dim(wr)
#dim(ww)
#convert to data frame
# Created variables for log and sqrt transformation
dfwr<-as.data.frame.matrix(wr)
dfww<-as.data.frame.matrix(ww)
logdfwr<-as.data.frame.matrix(wr)
logdfww<-as.data.frame.matrix(ww)
sqrtdfwr<-as.data.frame.matrix(wr)
sqrtdfww<-as.data.frame.matrix(ww)
# check for null values for both wines
sum(is.na(dfwr))
## [1] 0
dfwr<-na.omit(dfwr)
sum(is.na(dfww))
## [1] 0
dfww<-na.omit(dfww)
dim(dfwr)
## [1] 1599 12
dim(dfww)
## [1] 4898 12
# check for null is done
# untransformed
summary(dfwr)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
# drawing distribution of all attributes for red wine
par(mfrow=c(2,2), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
barplot((table(dfwr$quality)), col=c("DeepSkyBlue4", "DeepSkyBlue", "DeepSkyBlue1", "DeepSkyBlue2", "DeepSkyBlue3", "DeepSkyBlue4"))
mtext("Quality", side=1, outer=F, line=2, cex=0.8)
truehist(dfwr$fixed.acidity, h = 0.5, col="DeepSkyBlue")
mtext("Fixed Acidity", side=1, outer=F, line=2, cex=0.8)
truehist(dfwr$volatile.acidity, h = 0.05, col="DeepSkyBlue")
mtext("Volatile Acidity", side=1, outer=F, line=2, cex=0.8)
truehist(dfwr$citric.acid, h = 0.1, col="DeepSkyBlue")
mtext("Citric Acid", side=1, outer=F, line=2, cex=0.8)
par(mfrow=c(2,2), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
truehist(dfwr$residual.sugar, h = 0.5, col="DeepSkyBlue")
mtext("Residual Sugar", side=1, outer=F, line=2, cex=0.8)
truehist(dfwr$chlorides, h = 0.01, col="DeepSkyBlue")
mtext("chlorides", side=1, outer=F, line=2, cex=0.8)
truehist(dfwr$free.sulfur.dioxide, h = 0.05, col="DeepSkyBlue")
mtext("free.sulfur.dioxide", side=1, outer=F, line=2, cex=0.8)
truehist(dfwr$total.sulfur.dioxide, h = 0.1, col="DeepSkyBlue")
mtext("total.sulfur.dioxide", side=1, outer=F, line=2, cex=0.8)
par(mfrow=c(2,2), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
truehist(dfwr$density, h = 0.1, col="DeepSkyBlue")
mtext("Density", side=1, outer=F, line=2, cex=0.8)
truehist(dfwr$pH, h = 0.1, col="DeepSkyBlue")
mtext("PH", side=1, outer=F, line=2, cex=0.8)
truehist(dfwr$sulphates, h = 0.05, col="DeepSkyBlue")
mtext("Sulpahtes", side=1, outer=F, line=2, cex=0.8)
truehist(dfwr$alcohol, h = 0.1, col="DeepSkyBlue")
mtext("alcohol", side=1, outer=F, line=2, cex=0.8)
summary(dfww)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
# drawing distribution of all attributes for white wine
par(mfrow=c(2,2), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
barplot((table(dfww$quality)), col=c("DeepSkyBlue4", "DeepSkyBlue", "DeepSkyBlue1", "DeepSkyBlue2", "DeepSkyBlue3", "DeepSkyBlue4"))
mtext("Quality", side=1, outer=F, line=2, cex=0.8)
truehist(dfww$fixed.acidity, h = 0.5, col="DeepSkyBlue")
mtext("Fixed Acidity", side=1, outer=F, line=2, cex=0.8)
truehist(dfww$volatile.acidity, h = 0.05, col="DeepSkyBlue")
mtext("Volatile Acidity", side=1, outer=F, line=2, cex=0.8)
truehist(dfww$citric.acid, h = 0.1, col="DeepSkyBlue")
mtext("Citric Acid", side=1, outer=F, line=2, cex=0.8)
par(mfrow=c(2,2), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
truehist(dfww$residual.sugar, h = 0.5, col="DeepSkyBlue")
mtext("Residual Sugar", side=1, outer=F, line=2, cex=0.8)
truehist(dfww$chlorides, h = 0.01, col="DeepSkyBlue")
mtext("chlorides", side=1, outer=F, line=2, cex=0.8)
truehist(dfww$free.sulfur.dioxide, h = 0.05, col="DeepSkyBlue")
mtext("free.sulfur.dioxide", side=1, outer=F, line=2, cex=0.8)
truehist(dfww$total.sulfur.dioxide, h = 0.1, col="DeepSkyBlue")
mtext("total.sulfur.dioxide", side=1, outer=F, line=2, cex=0.8)
par(mfrow=c(2,2), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
truehist(dfww$density, h = 0.1, col="DeepSkyBlue")
mtext("Density", side=1, outer=F, line=2, cex=0.8)
truehist(dfww$pH, h = 0.1, col="DeepSkyBlue")
mtext("PH", side=1, outer=F, line=2, cex=0.8)
truehist(dfww$sulphates, h = 0.05, col="DeepSkyBlue")
mtext("Sulpahtes", side=1, outer=F, line=2, cex=0.8)
truehist(dfww$alcohol, h = 0.1, col="DeepSkyBlue")
mtext("alcohol", side=1, outer=F, line=2, cex=0.8)
1.By looking at the summary data we can say that quality is pretty much normally distributes with most values either 5 or 6. fixed and volatile acidity also have a sort of normal distribution citric acid is more uniform with a peak at the lower end
residual.sugar shows that the distribution nearly normal and somewhat right skewed.sulphates and So2 also show the same pattern
pH and density distribution also show normal distribution alcohol is not a normally distribute
2.White wine analysis of attributes
quality,fixed acidity,volatile acidity and citric acid are same as for the red wine. residual sugars and chlorides are rightly skewed. So2 values are somewhat normal. density seem to have a lot outliers. PH is normally distributed
#Boxplots of attributes for red wine
par(mfrow=c(1,6), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
boxplot(dfwr$fixed.acidity, pch=19)
mtext("Fixed Acidity", cex=0.8, side=1, line=2)
boxplot(dfwr$volatile.acidity, pch=19)
mtext("volatile.acidity", cex=0.8, side=1, line=2)
boxplot(dfwr$citric.acid, pch=19)
mtext("citric.acid", cex=0.8, side=1, line=2)
boxplot(dfwr$residual.sugar, pch=19)
mtext("residual.sugar", cex=0.8, side=1, line=2)
boxplot(dfwr$chlorides, pch=19)
mtext("chlorides", cex=0.8, side=1, line=2)
boxplot(dfwr$free.sulfur.dioxide, pch=19)
mtext("free.sulfur.dioxide", cex=0.8, side=1, line=2)
par(mfrow=c(1,5), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
boxplot(dfwr$total.sulfur.dioxide, pch=19)
mtext("total.sulfur.dioxide", cex=0.8, side=1, line=2)
boxplot(dfwr$density, pch=19)
mtext("Density", cex=0.8, side=1, line=2)
boxplot(dfwr$pH, pch=19)
mtext("PH", cex=0.8, side=1, line=2)
boxplot(dfwr$sulphates, pch=19)
mtext("Sulphates", cex=0.8, side=1, line=2)
boxplot(dfwr$alcohol, pch=19)
mtext("Alcohol", cex=0.8, side=1, line=2)
boxplot(dfwr$quality, pch=19)
mtext("Quality", cex=0.8, side=1, line=2)
#Boxplots of attributes for red wine
par(mfrow=c(1,6), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
boxplot(dfww$fixed.acidity, pch=19)
mtext("Fixed Acidity", cex=0.8, side=1, line=2)
boxplot(dfww$volatile.acidity, pch=19)
mtext("volatile.acidity", cex=0.8, side=1, line=2)
boxplot(dfww$citric.acid, pch=19)
mtext("citric.acid", cex=0.8, side=1, line=2)
boxplot(dfww$residual.sugar, pch=19)
mtext("residual.sugar", cex=0.8, side=1, line=2)
boxplot(dfww$chlorides, pch=19)
mtext("chlorides", cex=0.8, side=1, line=2)
boxplot(dfww$free.sulfur.dioxide, pch=19)
mtext("free.sulfur.dioxide", cex=0.8, side=1, line=2)
par(mfrow=c(1,5), oma = c(1,1,0,0) + 0.1, mar = c(3,3,1,1) + 0.1)
boxplot(dfww$total.sulfur.dioxide, pch=19)
mtext("total.sulfur.dioxide", cex=0.8, side=1, line=2)
boxplot(dfww$density, pch=19)
mtext("Density", cex=0.8, side=1, line=2)
boxplot(dfww$pH, pch=19)
mtext("PH", cex=0.8, side=1, line=2)
boxplot(dfww$sulphates, pch=19)
mtext("Sulphates", cex=0.8, side=1, line=2)
boxplot(dfww$alcohol, pch=19)
mtext("Alcohol", cex=0.8, side=1, line=2)
boxplot(dfww$quality, pch=19)
mtext("Quality", cex=0.8, side=1, line=2)
It looks like by looking at the box plots above there are outliers in almost all the attributes .
#correlations
signif(cor(wr[,colnames(wr)]),3)
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.0000 -0.25600 0.6720
## volatile.acidity -0.2560 1.00000 -0.5520
## citric.acid 0.6720 -0.55200 1.0000
## residual.sugar 0.1150 0.00192 0.1440
## chlorides 0.0937 0.06130 0.2040
## free.sulfur.dioxide -0.1540 -0.01050 -0.0610
## total.sulfur.dioxide -0.1130 0.07650 0.0355
## density 0.6680 0.02200 0.3650
## pH -0.6830 0.23500 -0.5420
## sulphates 0.1830 -0.26100 0.3130
## alcohol -0.0617 -0.20200 0.1100
## quality 0.1240 -0.39100 0.2260
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.11500 0.09370 -0.15400
## volatile.acidity 0.00192 0.06130 -0.01050
## citric.acid 0.14400 0.20400 -0.06100
## residual.sugar 1.00000 0.05560 0.18700
## chlorides 0.05560 1.00000 0.00556
## free.sulfur.dioxide 0.18700 0.00556 1.00000
## total.sulfur.dioxide 0.20300 0.04740 0.66800
## density 0.35500 0.20100 -0.02190
## pH -0.08570 -0.26500 0.07040
## sulphates 0.00553 0.37100 0.05170
## alcohol 0.04210 -0.22100 -0.06940
## quality 0.01370 -0.12900 -0.05070
## total.sulfur.dioxide density pH sulphates
## fixed.acidity -0.1130 0.6680 -0.6830 0.18300
## volatile.acidity 0.0765 0.0220 0.2350 -0.26100
## citric.acid 0.0355 0.3650 -0.5420 0.31300
## residual.sugar 0.2030 0.3550 -0.0857 0.00553
## chlorides 0.0474 0.2010 -0.2650 0.37100
## free.sulfur.dioxide 0.6680 -0.0219 0.0704 0.05170
## total.sulfur.dioxide 1.0000 0.0713 -0.0665 0.04290
## density 0.0713 1.0000 -0.3420 0.14900
## pH -0.0665 -0.3420 1.0000 -0.19700
## sulphates 0.0429 0.1490 -0.1970 1.00000
## alcohol -0.2060 -0.4960 0.2060 0.09360
## quality -0.1850 -0.1750 -0.0577 0.25100
## alcohol quality
## fixed.acidity -0.0617 0.1240
## volatile.acidity -0.2020 -0.3910
## citric.acid 0.1100 0.2260
## residual.sugar 0.0421 0.0137
## chlorides -0.2210 -0.1290
## free.sulfur.dioxide -0.0694 -0.0507
## total.sulfur.dioxide -0.2060 -0.1850
## density -0.4960 -0.1750
## pH 0.2060 -0.0577
## sulphates 0.0936 0.2510
## alcohol 1.0000 0.4760
## quality 0.4760 1.0000
signif(cor(ww[,colnames(ww)]),3)
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.0000 -0.0227 0.28900
## volatile.acidity -0.0227 1.0000 -0.14900
## citric.acid 0.2890 -0.1490 1.00000
## residual.sugar 0.0890 0.0643 0.09420
## chlorides 0.0231 0.0705 0.11400
## free.sulfur.dioxide -0.0494 -0.0970 0.09410
## total.sulfur.dioxide 0.0911 0.0893 0.12100
## density 0.2650 0.0271 0.15000
## pH -0.4260 -0.0319 -0.16400
## sulphates -0.0171 -0.0357 0.06230
## alcohol -0.1210 0.0677 -0.07570
## quality -0.1140 -0.1950 -0.00921
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.0890 0.0231 -0.049400
## volatile.acidity 0.0643 0.0705 -0.097000
## citric.acid 0.0942 0.1140 0.094100
## residual.sugar 1.0000 0.0887 0.299000
## chlorides 0.0887 1.0000 0.101000
## free.sulfur.dioxide 0.2990 0.1010 1.000000
## total.sulfur.dioxide 0.4010 0.1990 0.616000
## density 0.8390 0.2570 0.294000
## pH -0.1940 -0.0904 -0.000618
## sulphates -0.0267 0.0168 0.059200
## alcohol -0.4510 -0.3600 -0.250000
## quality -0.0976 -0.2100 0.008160
## total.sulfur.dioxide density pH sulphates
## fixed.acidity 0.09110 0.2650 -0.426000 -0.0171
## volatile.acidity 0.08930 0.0271 -0.031900 -0.0357
## citric.acid 0.12100 0.1500 -0.164000 0.0623
## residual.sugar 0.40100 0.8390 -0.194000 -0.0267
## chlorides 0.19900 0.2570 -0.090400 0.0168
## free.sulfur.dioxide 0.61600 0.2940 -0.000618 0.0592
## total.sulfur.dioxide 1.00000 0.5300 0.002320 0.1350
## density 0.53000 1.0000 -0.093600 0.0745
## pH 0.00232 -0.0936 1.000000 0.1560
## sulphates 0.13500 0.0745 0.156000 1.0000
## alcohol -0.44900 -0.7800 0.121000 -0.0174
## quality -0.17500 -0.3070 0.099400 0.0537
## alcohol quality
## fixed.acidity -0.1210 -0.11400
## volatile.acidity 0.0677 -0.19500
## citric.acid -0.0757 -0.00921
## residual.sugar -0.4510 -0.09760
## chlorides -0.3600 -0.21000
## free.sulfur.dioxide -0.2500 0.00816
## total.sulfur.dioxide -0.4490 -0.17500
## density -0.7800 -0.30700
## pH 0.1210 0.09940
## sulphates -0.0174 0.05370
## alcohol 1.0000 0.43600
## quality 0.4360 1.00000
1.fixed acidity has a strong correlation with citric acid which seems natural and citric acid is acidic. 2.one thing to note is the strong relationship between density and fixed acidity. 3.it has a negative correlation with Ph which is strange because a acidic solutions have large PH values. 4.The variables most strongly correlated to quality are Volatile Acidity and Alcohol. citric acid and sulphates also has not so strong correlation. 5.Alcohol has negative correlation with density.
1.fixed acidity has a strong correlation with citric acid which seems natural and citric acid is acidic. 2.one thing to note is the strong relationship between density and fixed acidity. 3.it has a negative correlation with Ph which is strange because a acidic solutions have large PH values. 4.The variables most strongly correlated to quality are chlorides (-ive),density(-ive) and Alcohol(+ive). citric acid and sulphates also has not so strong correlation. 5.Alcohol has negative correlation with density.
Now creating box plots of red wine attributes against quality to see how they trend
ggplot(data = dfwr, aes(x = quality, y = fixed.acidity)) +
geom_jitter( alpha = .3) +
geom_boxplot(alpha = .1,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = dfwr, aes(x = quality, y = volatile.acidity)) +
geom_jitter( alpha = .3) +
geom_boxplot(alpha = .1,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = dfwr, aes(x = quality, y = citric.acid)) +
geom_jitter( alpha = .3) +
geom_boxplot(alpha = .1,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = dfwr, aes(x = quality, y = residual.sugar)) +
geom_jitter( alpha = .3) +
geom_boxplot(alpha = .1,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = dfwr, aes(x = quality, y = chlorides)) +
geom_jitter( alpha = .3) +
geom_boxplot(alpha = .1,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = dfwr, aes(x = quality, y = free.sulfur.dioxide)) +
geom_jitter( alpha = .3) +
geom_boxplot(alpha = .1,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = dfwr, aes(x = quality, y = total.sulfur.dioxide)) +
geom_jitter( alpha = .3) +
geom_boxplot(alpha = .1,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = dfwr, aes(x = quality, y = density)) +
geom_jitter( alpha = .3) +
geom_boxplot(alpha = .1,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = dfwr, aes(x = quality, y = pH)) +
geom_jitter( alpha = .3) +
geom_boxplot(alpha = .1,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = dfwr, aes(x = quality, y = sulphates)) +
geom_jitter( alpha = .3) +
geom_boxplot(alpha = .1,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = dfwr, aes(x = quality, y = alcohol)) +
geom_jitter( alpha = .3) +
geom_boxplot(alpha = .1,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
Now creating box plots of white wine attributes against quality to see how they trend
ggplot(data = dfww, aes(x = quality, y = fixed.acidity)) +
geom_jitter( alpha = .3) +
geom_boxplot(alpha = .1,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = dfww, aes(x = quality, y = volatile.acidity)) +
geom_jitter( alpha = .3) +
geom_boxplot(alpha = .1,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = dfww, aes(x = quality, y = citric.acid)) +
geom_jitter( alpha = .3) +
geom_boxplot(alpha = .1,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = dfww, aes(x = quality, y = residual.sugar)) +
geom_jitter( alpha = .3) +
geom_boxplot(alpha = .1,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = dfww, aes(x = quality, y = chlorides)) +
geom_jitter( alpha = .3) +
geom_boxplot(alpha = .1,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = dfww, aes(x = quality, y = free.sulfur.dioxide)) +
geom_jitter( alpha = .3) +
geom_boxplot(alpha = .1,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = dfww, aes(x = quality, y = total.sulfur.dioxide)) +
geom_jitter( alpha = .3) +
geom_boxplot(alpha = .1,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = dfww, aes(x = quality, y = density)) +
geom_jitter( alpha = .3) +
geom_boxplot(alpha = .1,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = dfww, aes(x = quality, y = pH)) +
geom_jitter( alpha = .3) +
geom_boxplot(alpha = .1,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = dfww, aes(x = quality, y = sulphates)) +
geom_jitter( alpha = .3) +
geom_boxplot(alpha = .1,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(data = dfww, aes(x = quality, y = alcohol)) +
geom_jitter( alpha = .3) +
geom_boxplot(alpha = .1,color = 'blue') +
stat_summary(fun.y = "mean", geom = "point", color = "red", shape = 8, size = 4)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
Fixed Acidity has almost no effect on the Quality.
Volatile acid seems to have a negative impact on the quality
more Citric acid more good quality wine
residual sugar has no effect on quality
weak correlation of chlorides with quality.lower values of Chlorides produce good quality wines.
high values of so2 produce better wine then low values of so2
total so2 has same result as above
density has definitely effecting the quality of wine though -ively
PH values also effect quality low PH values better quality although if it is very low quality dereases
sulphates and alcohol has +ive correlation with quality they both increase with quality
all the attributes are showing same behavior as red wine except the following:
Volatile acid has no effect on the quality
similarly citric acid has no effect
PH has a weak relationship with white wine quality
sulphates has no effect on quality
#Pairs of untransformed attributes
pairs(dfwr);
pairs(dfww);
#summary of untransformed linear regression
mwr<-lm(quality~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+pH+sulphates+alcohol,dfwr)
summary(mwr)
##
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## pH + sulphates + alcohol, data = dfwr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.67204 -0.36527 -0.04523 0.45628 2.03894
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4538341 0.6125783 7.271 5.59e-13 ***
## fixed.acidity 0.0081441 0.0160586 0.507 0.61212
## volatile.acidity -1.0964449 0.1200866 -9.130 < 2e-16 ***
## citric.acid -0.1836098 0.1471561 -1.248 0.21232
## residual.sugar 0.0089507 0.0120542 0.743 0.45787
## chlorides -1.9067341 0.4173928 -4.568 5.30e-06 ***
## free.sulfur.dioxide 0.0045147 0.0021631 2.087 0.03704 *
## total.sulfur.dioxide -0.0033120 0.0007264 -4.560 5.52e-06 ***
## pH -0.5042762 0.1571117 -3.210 0.00136 **
## sulphates 0.8928974 0.1107548 8.062 1.46e-15 ***
## alcohol 0.2927427 0.0173394 16.883 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6479 on 1588 degrees of freedom
## Multiple R-squared: 0.3603, Adjusted R-squared: 0.3562
## F-statistic: 89.43 on 10 and 1588 DF, p-value: < 2.2e-16
mww<-lm(quality~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+pH+sulphates+alcohol,dfww)
summary(mww)
##
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## pH + sulphates + alcohol, data = dfww)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9098 -0.4957 -0.0330 0.4666 3.1785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.0636371 0.3482321 5.926 3.32e-09 ***
## fixed.acidity -0.0503197 0.0149092 -3.375 0.000744 ***
## volatile.acidity -1.9583442 0.1138553 -17.200 < 2e-16 ***
## citric.acid -0.0289483 0.0961455 -0.301 0.763360
## residual.sugar 0.0256438 0.0025518 10.049 < 2e-16 ***
## chlorides -0.9525303 0.5425208 -1.756 0.079194 .
## free.sulfur.dioxide 0.0047672 0.0008391 5.682 1.41e-08 ***
## total.sulfur.dioxide -0.0008697 0.0003730 -2.331 0.019771 *
## pH 0.1651688 0.0825418 2.001 0.045444 *
## sulphates 0.4193440 0.0973099 4.309 1.67e-05 ***
## alcohol 0.3626941 0.0112672 32.190 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.756 on 4887 degrees of freedom
## Multiple R-squared: 0.2727, Adjusted R-squared: 0.2713
## F-statistic: 183.3 on 10 and 4887 DF, p-value: < 2.2e-16
# Log Transformed
#summary of untransformed linear regression
cols <- c("fixed.acidity","volatile.acidity","citric.acid","residual.sugar","chlorides","free.sulfur.dioxide","total.sulfur.dioxide","density","pH","sulphates","alcohol","quality")
logdfwr[cols] <- log(dfwr[cols]+1)
logdfww[cols] <- log(dfww[cols]+1)
summary(logdfwr)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. :1.723 Min. :0.1133 Min. :0.00000 Min. :0.6419
## 1st Qu.:2.092 1st Qu.:0.3293 1st Qu.:0.08618 1st Qu.:1.0647
## Median :2.186 Median :0.4187 Median :0.23111 Median :1.1632
## Mean :2.216 Mean :0.4172 Mean :0.22815 Mean :1.2181
## 3rd Qu.:2.322 3rd Qu.:0.4947 3rd Qu.:0.35066 3rd Qu.:1.2809
## Max. :2.827 Max. :0.9478 Max. :0.69315 Max. :2.8034
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01193 Min. :0.6931 Min. :1.946
## 1st Qu.:0.06766 1st Qu.:2.0794 1st Qu.:3.135
## Median :0.07603 Median :2.7081 Median :3.664
## Mean :0.08304 Mean :2.6390 Mean :3.635
## 3rd Qu.:0.08618 3rd Qu.:3.0910 3rd Qu.:4.143
## Max. :0.47686 Max. :4.2905 Max. :5.670
## density pH sulphates alcohol
## Min. :0.6882 Min. :1.319 Min. :0.2852 Min. :2.241
## 1st Qu.:0.6909 1st Qu.:1.437 1st Qu.:0.4383 1st Qu.:2.351
## Median :0.6915 Median :1.461 Median :0.4824 Median :2.416
## Mean :0.6915 Mean :1.461 Mean :0.5011 Mean :2.431
## 3rd Qu.:0.6921 3rd Qu.:1.482 3rd Qu.:0.5481 3rd Qu.:2.493
## Max. :0.6950 Max. :1.611 Max. :1.0986 Max. :2.766
## quality
## Min. :1.386
## 1st Qu.:1.792
## Median :1.946
## Mean :1.885
## 3rd Qu.:1.946
## Max. :2.197
summary(logdfww)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. :1.569 Min. :0.07696 Min. :0.0000 Min. :0.4700
## 1st Qu.:1.988 1st Qu.:0.19062 1st Qu.:0.2390 1st Qu.:0.9933
## Median :2.054 Median :0.23111 Median :0.2776 Median :1.8245
## Mean :2.055 Mean :0.24257 Mean :0.2844 Mean :1.7522
## 3rd Qu.:2.116 3rd Qu.:0.27763 3rd Qu.:0.3293 3rd Qu.:2.3888
## Max. :2.721 Max. :0.74194 Max. :0.9783 Max. :4.2017
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00896 Min. :1.099 Min. :2.303
## 1st Qu.:0.03537 1st Qu.:3.178 1st Qu.:4.691
## Median :0.04210 Median :3.555 Median :4.905
## Mean :0.04455 Mean :3.472 Mean :4.886
## 3rd Qu.:0.04879 3rd Qu.:3.850 3rd Qu.:5.124
## Max. :0.29714 Max. :5.670 Max. :6.089
## density pH sulphates alcohol
## Min. :0.6867 Min. :1.314 Min. :0.1989 Min. :2.197
## 1st Qu.:0.6890 1st Qu.:1.409 1st Qu.:0.3436 1st Qu.:2.351
## Median :0.6900 Median :1.430 Median :0.3853 Median :2.434
## Mean :0.6902 Mean :1.432 Mean :0.3959 Mean :2.438
## 3rd Qu.:0.6912 3rd Qu.:1.454 3rd Qu.:0.4383 3rd Qu.:2.518
## Max. :0.7124 Max. :1.573 Max. :0.7324 Max. :2.721
## quality
## Min. :1.386
## 1st Qu.:1.792
## Median :1.946
## Mean :1.920
## 3rd Qu.:1.946
## Max. :2.303
#correlations
signif(cor(logdfwr[,colnames(wr)]),3)
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.000 -0.2610 0.6620
## volatile.acidity -0.261 1.0000 -0.5750
## citric.acid 0.662 -0.5750 1.0000
## residual.sugar 0.159 0.0242 0.1640
## chlorides 0.120 0.0726 0.1890
## free.sulfur.dioxide -0.178 0.0207 -0.0796
## total.sulfur.dioxide -0.114 0.0841 0.0128
## density 0.674 0.0300 0.3590
## pH -0.704 0.2320 -0.5440
## sulphates 0.191 -0.2830 0.3200
## alcohol -0.090 -0.2140 0.0997
## quality 0.113 -0.3960 0.2200
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.1590 0.12000 -0.17800
## volatile.acidity 0.0242 0.07260 0.02070
## citric.acid 0.1640 0.18900 -0.07960
## residual.sugar 1.0000 0.05590 0.10000
## chlorides 0.0559 1.00000 -0.00557
## free.sulfur.dioxide 0.1000 -0.00557 1.00000
## total.sulfur.dioxide 0.1540 0.06220 0.78400
## density 0.4060 0.21900 -0.03960
## pH -0.0896 -0.27300 0.09580
## sulphates 0.0156 0.33800 0.05530
## alcohol 0.0751 -0.23600 -0.08320
## quality 0.0173 -0.13400 -0.03870
## total.sulfur.dioxide density pH sulphates
## fixed.acidity -0.1140 0.6740 -0.7040 0.1910
## volatile.acidity 0.0841 0.0300 0.2320 -0.2830
## citric.acid 0.0128 0.3590 -0.5440 0.3200
## residual.sugar 0.1540 0.4060 -0.0896 0.0156
## chlorides 0.0622 0.2190 -0.2730 0.3380
## free.sulfur.dioxide 0.7840 -0.0396 0.0958 0.0553
## total.sulfur.dioxide 1.0000 0.1040 -0.0171 0.0593
## density 0.1040 1.0000 -0.3410 0.1570
## pH -0.0171 -0.3410 1.0000 -0.1840
## sulphates 0.0593 0.1570 -0.1840 1.0000
## alcohol -0.2370 -0.4920 0.2030 0.1150
## quality -0.1550 -0.1670 -0.0603 0.2760
## alcohol quality
## fixed.acidity -0.0900 0.1130
## volatile.acidity -0.2140 -0.3960
## citric.acid 0.0997 0.2200
## residual.sugar 0.0751 0.0173
## chlorides -0.2360 -0.1340
## free.sulfur.dioxide -0.0832 -0.0387
## total.sulfur.dioxide -0.2370 -0.1550
## density -0.4920 -0.1670
## pH 0.2030 -0.0603
## sulphates 0.1150 0.2760
## alcohol 1.0000 0.4590
## quality 0.4590 1.0000
signif(cor(logdfww[,colnames(ww)]),3)
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.0000 -0.0306 0.3040
## volatile.acidity -0.0306 1.0000 -0.1710
## citric.acid 0.3040 -0.1710 1.0000
## residual.sugar 0.0874 0.0925 0.0710
## chlorides 0.0341 0.0682 0.1070
## free.sulfur.dioxide -0.0465 -0.1130 0.0869
## total.sulfur.dioxide 0.0849 0.0719 0.1150
## density 0.2760 0.0253 0.1460
## pH -0.4350 -0.0346 -0.1660
## sulphates -0.0153 -0.0373 0.0672
## alcohol -0.1250 0.0577 -0.0689
## quality -0.1120 -0.2090 0.0100
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.0874 0.0341 -0.0465
## volatile.acidity 0.0925 0.0682 -0.1130
## citric.acid 0.0710 0.1070 0.0869
## residual.sugar 1.0000 0.0836 0.3260
## chlorides 0.0836 1.0000 0.0957
## free.sulfur.dioxide 0.3260 0.0957 1.0000
## total.sulfur.dioxide 0.4090 0.2040 0.6290
## density 0.7780 0.2690 0.2850
## pH -0.1840 -0.0906 0.0217
## sulphates -0.0324 0.0237 0.0631
## alcohol -0.4250 -0.3750 -0.2310
## quality -0.0686 -0.2120 0.1050
## total.sulfur.dioxide density pH sulphates
## fixed.acidity 0.0849 0.2760 -0.4350 -0.0153
## volatile.acidity 0.0719 0.0253 -0.0346 -0.0373
## citric.acid 0.1150 0.1460 -0.1660 0.0672
## residual.sugar 0.4090 0.7780 -0.1840 -0.0324
## chlorides 0.2040 0.2690 -0.0906 0.0237
## free.sulfur.dioxide 0.6290 0.2850 0.0217 0.0631
## total.sulfur.dioxide 1.0000 0.5060 0.0179 0.1410
## density 0.5060 1.0000 -0.0948 0.0823
## pH 0.0179 -0.0948 1.0000 0.1580
## sulphates 0.1410 0.0823 0.1580 1.0000
## alcohol -0.4300 -0.7860 0.1290 -0.0252
## quality -0.1160 -0.2980 0.0957 0.0500
## alcohol quality
## fixed.acidity -0.1250 -0.1120
## volatile.acidity 0.0577 -0.2090
## citric.acid -0.0689 0.0100
## residual.sugar -0.4250 -0.0686
## chlorides -0.3750 -0.2120
## free.sulfur.dioxide -0.2310 0.1050
## total.sulfur.dioxide -0.4300 -0.1160
## density -0.7860 -0.2980
## pH 0.1290 0.0957
## sulphates -0.0252 0.0500
## alcohol 1.0000 0.4200
## quality 0.4200 1.0000
mwr<-lm(quality~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+alcohol,logdfwr)
summary(mwr)
##
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol, data = logdfwr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51177 -0.05083 -0.00499 0.06926 0.27889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.327371 4.749294 1.122 0.26215
## fixed.acidity 0.058952 0.039964 1.475 0.14038
## volatile.acidity -0.274983 0.029618 -9.284 < 2e-16 ***
## citric.acid -0.063666 0.028694 -2.219 0.02664 *
## residual.sugar 0.010098 0.012707 0.795 0.42694
## chlorides -0.315844 0.076437 -4.132 3.78e-05 ***
## free.sulfur.dioxide 0.016154 0.006785 2.381 0.01738 *
## total.sulfur.dioxide -0.020126 0.006523 -3.085 0.00207 **
## density -6.251071 7.025402 -0.890 0.37372
## pH -0.228828 0.131154 -1.745 0.08123 .
## sulphates 0.267088 0.032067 8.329 < 2e-16 ***
## alcohol 0.462217 0.049030 9.427 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09955 on 1587 degrees of freedom
## Multiple R-squared: 0.3468, Adjusted R-squared: 0.3423
## F-statistic: 76.61 on 11 and 1587 DF, p-value: < 2.2e-16
mww<-lm(quality~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+alcohol,logdfww)
summary(mww)
##
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol, data = logdfww)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.62344 -0.06828 0.00104 0.07211 0.47466
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.067607 2.509740 5.207 2.00e-07 ***
## fixed.acidity -0.003205 0.020846 -0.154 0.87783
## volatile.acidity -0.379659 0.022403 -16.947 < 2e-16 ***
## citric.acid 0.014139 0.019555 0.723 0.46970
## residual.sugar 0.045812 0.004989 9.183 < 2e-16 ***
## chlorides -0.154266 0.087524 -1.763 0.07804 .
## free.sulfur.dioxide 0.043017 0.004095 10.506 < 2e-16 ***
## total.sulfur.dioxide -0.018855 0.007018 -2.686 0.00725 **
## density -18.323801 3.642544 -5.030 5.07e-07 ***
## pH 0.184960 0.057225 3.232 0.00124 **
## sulphates 0.113451 0.022213 5.107 3.39e-07 ***
## alcohol 0.472880 0.033085 14.293 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1102 on 4886 degrees of freedom
## Multiple R-squared: 0.288, Adjusted R-squared: 0.2864
## F-statistic: 179.7 on 11 and 4886 DF, p-value: < 2.2e-16
#Pairs of log transformed
pairs(logdfwr);
pairs(logdfww);
# square Transformed
cols <- c("fixed.acidity","volatile.acidity","citric.acid","residual.sugar","chlorides","free.sulfur.dioxide","total.sulfur.dioxide","density","pH","sulphates","alcohol","quality")
sqrtdfwr[cols] <- sqrt(dfwr[cols]+1)
sqrtdfww[cols] <- sqrt(dfww[cols]+1)
summary(sqrtdfwr)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. :2.366 Min. :1.058 Min. :1.000 Min. :1.378
## 1st Qu.:2.846 1st Qu.:1.179 1st Qu.:1.044 1st Qu.:1.703
## Median :2.983 Median :1.233 Median :1.122 Median :1.789
## Mean :3.040 Mean :1.234 Mean :1.124 Mean :1.857
## 3rd Qu.:3.194 3rd Qu.:1.281 3rd Qu.:1.192 3rd Qu.:1.897
## Max. :4.111 Max. :1.606 Max. :1.414 Max. :4.062
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :1.006 Min. :1.414 Min. : 2.646 Min. :1.411
## 1st Qu.:1.034 1st Qu.:2.828 1st Qu.: 4.796 1st Qu.:1.413
## Median :1.039 Median :3.873 Median : 6.245 Median :1.413
## Mean :1.043 Mean :3.925 Mean : 6.521 Mean :1.413
## 3rd Qu.:1.044 3rd Qu.:4.690 3rd Qu.: 7.937 3rd Qu.:1.413
## Max. :1.269 Max. :8.544 Max. :17.029 Max. :1.416
## pH sulphates alcohol quality
## Min. :1.934 Min. :1.153 Min. :3.066 Min. :2.000
## 1st Qu.:2.052 1st Qu.:1.245 1st Qu.:3.240 1st Qu.:2.449
## Median :2.076 Median :1.273 Median :3.347 Median :2.646
## Mean :2.076 Mean :1.286 Mean :3.376 Mean :2.571
## 3rd Qu.:2.098 3rd Qu.:1.315 3rd Qu.:3.479 3rd Qu.:2.646
## Max. :2.238 Max. :1.732 Max. :3.987 Max. :3.000
summary(sqrtdfww)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. :2.191 Min. :1.039 Min. :1.000 Min. :1.265
## 1st Qu.:2.702 1st Qu.:1.100 1st Qu.:1.127 1st Qu.:1.643
## Median :2.793 Median :1.122 Median :1.149 Median :2.490
## Mean :2.799 Mean :1.130 Mean :1.154 Mean :2.561
## 3rd Qu.:2.881 3rd Qu.:1.149 3rd Qu.:1.179 3rd Qu.:3.302
## Max. :3.899 Max. :1.449 Max. :1.631 Max. :8.173
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :1.004 Min. : 1.732 Min. : 3.162 Min. :1.410
## 1st Qu.:1.018 1st Qu.: 4.899 1st Qu.:10.440 1st Qu.:1.411
## Median :1.021 Median : 5.916 Median :11.619 Median :1.412
## Mean :1.023 Mean : 5.859 Mean :11.662 Mean :1.412
## 3rd Qu.:1.025 3rd Qu.: 6.856 3rd Qu.:12.961 3rd Qu.:1.413
## Max. :1.160 Max. :17.029 Max. :21.000 Max. :1.428
## pH sulphates alcohol quality
## Min. :1.929 Min. :1.105 Min. :3.000 Min. :2.000
## 1st Qu.:2.022 1st Qu.:1.187 1st Qu.:3.240 1st Qu.:2.449
## Median :2.045 Median :1.212 Median :3.376 Median :2.646
## Mean :2.046 Mean :1.220 Mean :3.389 Mean :2.617
## 3rd Qu.:2.069 3rd Qu.:1.245 3rd Qu.:3.521 3rd Qu.:2.646
## Max. :2.195 Max. :1.442 Max. :3.899 Max. :3.162
#correlations
signif(cor(sqrtdfwr[,colnames(wr)]),3)
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.0000 -0.2600 0.6680
## volatile.acidity -0.2600 1.0000 -0.5650
## citric.acid 0.6680 -0.5650 1.0000
## residual.sugar 0.1380 0.0135 0.1560
## chlorides 0.1070 0.0670 0.1960
## free.sulfur.dioxide -0.1690 0.0031 -0.0711
## total.sulfur.dioxide -0.1160 0.0822 0.0240
## density 0.6720 0.0261 0.3620
## pH -0.6950 0.2340 -0.5440
## sulphates 0.1890 -0.2730 0.3170
## alcohol -0.0756 -0.2080 0.1050
## quality 0.1190 -0.3950 0.2240
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.1380 0.10700 -0.16900
## volatile.acidity 0.0135 0.06700 0.00310
## citric.acid 0.1560 0.19600 -0.07110
## residual.sugar 1.0000 0.05510 0.13900
## chlorides 0.0551 1.00000 -0.00108
## free.sulfur.dioxide 0.1390 -0.00108 1.00000
## total.sulfur.dioxide 0.1810 0.05730 0.73700
## density 0.3830 0.21000 -0.03300
## pH -0.0888 -0.26900 0.08490
## sulphates 0.0104 0.35500 0.05490
## alcohol 0.0606 -0.22900 -0.07650
## quality 0.0159 -0.13100 -0.04650
## total.sulfur.dioxide density pH sulphates
## fixed.acidity -0.1160 0.6720 -0.6950 0.1890
## volatile.acidity 0.0822 0.0261 0.2340 -0.2730
## citric.acid 0.0240 0.3620 -0.5440 0.3170
## residual.sugar 0.1810 0.3830 -0.0888 0.0104
## chlorides 0.0573 0.2100 -0.2690 0.3550
## free.sulfur.dioxide 0.7370 -0.0330 0.0849 0.0549
## total.sulfur.dioxide 1.0000 0.0894 -0.0412 0.0490
## density 0.0894 1.0000 -0.3410 0.1530
## pH -0.0412 -0.3410 1.0000 -0.1900
## sulphates 0.0490 0.1530 -0.1900 1.0000
## alcohol -0.2270 -0.4940 0.2040 0.1050
## quality -0.1780 -0.1710 -0.0590 0.2650
## alcohol quality
## fixed.acidity -0.0756 0.1190
## volatile.acidity -0.2080 -0.3950
## citric.acid 0.1050 0.2240
## residual.sugar 0.0606 0.0159
## chlorides -0.2290 -0.1310
## free.sulfur.dioxide -0.0765 -0.0465
## total.sulfur.dioxide -0.2270 -0.1780
## density -0.4940 -0.1710
## pH 0.2040 -0.0590
## sulphates 0.1050 0.2650
## alcohol 1.0000 0.4680
## quality 0.4680 1.0000
signif(cor(sqrtdfww[,colnames(ww)]),3)
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.0000 -0.0269 0.297000
## volatile.acidity -0.0269 1.0000 -0.161000
## citric.acid 0.2970 -0.1610 1.000000
## residual.sugar 0.0897 0.0763 0.083700
## chlorides 0.0285 0.0694 0.111000
## free.sulfur.dioxide -0.0484 -0.1070 0.094600
## total.sulfur.dioxide 0.0893 0.0827 0.119000
## density 0.2710 0.0261 0.148000
## pH -0.4310 -0.0333 -0.165000
## sulphates -0.0164 -0.0367 0.064900
## alcohol -0.1230 0.0628 -0.072600
## quality -0.1130 -0.2020 0.000202
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.0897 0.0285 -0.04840
## volatile.acidity 0.0763 0.0694 -0.10700
## citric.acid 0.0837 0.1110 0.09460
## residual.sugar 1.0000 0.0875 0.32800
## chlorides 0.0875 1.0000 0.10100
## free.sulfur.dioxide 0.3280 0.1010 1.00000
## total.sulfur.dioxide 0.4170 0.2040 0.62800
## density 0.8160 0.2630 0.29900
## pH -0.1930 -0.0905 0.00855
## sulphates -0.0316 0.0202 0.06040
## alcohol -0.4470 -0.3680 -0.24800
## quality -0.0861 -0.2120 0.05440
## total.sulfur.dioxide density pH sulphates
## fixed.acidity 0.08930 0.2710 -0.43100 -0.0164
## volatile.acidity 0.08270 0.0261 -0.03330 -0.0367
## citric.acid 0.11900 0.1480 -0.16500 0.0649
## residual.sugar 0.41700 0.8160 -0.19300 -0.0316
## chlorides 0.20400 0.2630 -0.09050 0.0202
## free.sulfur.dioxide 0.62800 0.2990 0.00855 0.0604
## total.sulfur.dioxide 1.00000 0.5260 0.00955 0.1380
## density 0.52600 1.0000 -0.09420 0.0784
## pH 0.00955 -0.0942 1.00000 0.1570
## sulphates 0.13800 0.0784 0.15700 1.0000
## alcohol -0.44600 -0.7830 0.12500 -0.0214
## quality -0.15000 -0.3030 0.09780 0.0518
## alcohol quality
## fixed.acidity -0.1230 -0.113000
## volatile.acidity 0.0628 -0.202000
## citric.acid -0.0726 0.000202
## residual.sugar -0.4470 -0.086100
## chlorides -0.3680 -0.212000
## free.sulfur.dioxide -0.2480 0.054400
## total.sulfur.dioxide -0.4460 -0.150000
## density -0.7830 -0.303000
## pH 0.1250 0.097800
## sulphates -0.0214 0.051800
## alcohol 1.0000 0.429000
## quality 0.4290 1.000000
mwr<-lm(quality~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+alcohol,sqrtdfwr)
summary(mwr)
##
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol, data = sqrtdfwr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58663 -0.06835 -0.00742 0.08739 0.36943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.656340 17.363964 1.017 0.3094
## fixed.acidity 0.038927 0.032800 1.187 0.2355
## volatile.acidity -0.544648 0.059954 -9.084 < 2e-16 ***
## citric.acid -0.111235 0.065160 -1.707 0.0880 .
## residual.sugar 0.013494 0.014351 0.940 0.3472
## chlorides -0.767905 0.178518 -4.302 1.80e-05 ***
## free.sulfur.dioxide 0.009786 0.004049 2.417 0.0158 *
## total.sulfur.dioxide -0.009285 0.002331 -3.984 7.09e-05 ***
## density -10.477681 12.461518 -0.841 0.4006
## pH -0.308713 0.159589 -1.934 0.0532 .
## sulphates 0.497197 0.060570 8.209 4.58e-16 ***
## alcohol 0.354957 0.036287 9.782 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1264 on 1587 degrees of freedom
## Multiple R-squared: 0.3557, Adjusted R-squared: 0.3512
## F-statistic: 79.63 on 11 and 1587 DF, p-value: < 2.2e-16
mww<-lm(quality~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+alcohol,sqrtdfww)
summary(mww)
##
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol, data = sqrtdfww)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.76917 -0.09107 -0.00360 0.09041 0.69875
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.120546 11.473447 6.809 1.10e-11 ***
## fixed.acidity 0.031764 0.020585 1.543 0.123
## volatile.acidity -0.845340 0.050416 -16.767 < 2e-16 ***
## citric.acid 0.019880 0.043267 0.459 0.646
## residual.sugar 0.065247 0.006379 10.228 < 2e-16 ***
## chlorides -0.228175 0.217964 -1.047 0.295
## free.sulfur.dioxide 0.014719 0.001979 7.438 1.20e-13 ***
## total.sulfur.dioxide -0.003540 0.001679 -2.109 0.035 *
## density -54.410476 8.197226 -6.638 3.53e-11 ***
## pH 0.385046 0.076778 5.015 5.49e-07 ***
## sulphates 0.272193 0.047097 5.779 7.96e-09 ***
## alcohol 0.316998 0.027597 11.487 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1434 on 4886 degrees of freedom
## Multiple R-squared: 0.2851, Adjusted R-squared: 0.2835
## F-statistic: 177.1 on 11 and 4886 DF, p-value: < 2.2e-16
#Pairs of sqrt transformed
pairs(logdfwr);
pairs(logdfww);
The summary of liner regression for untrsnsformed , log transformed and sqrt trandformed are having almost the same R^2 but the RSE is lowest for the log transformed dataset. so it is better to consider log transformed dataset. by analyzing the data the following attributes can be considered as potentials predictors for red wine in increasing order of the correlation: 1. alcohol 2. volatile.acidity 3. sulphates 4.citrix.acid
For white wine following are the attributes 1.volatile.acidity 2.chlorides 3.alcohol
Regarding paiwaise we se strong corelation between Ph and fixed.acidity . Similarly there is one between total.sulfur.dioxide and free.sulfur.dioxide.
In order to show the corrleation between the pairs and the predictors we are drawing correlation matrix specific for them
#for red wine we will draw correlation between quality and alcohol,volatile.acidity,sulphates
ggplot(logdfwr,aes(x=quality,y=alcohol)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
ggplot(logdfwr,aes(x=quality,y=sulphates)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
ggplot(logdfwr,aes(x=quality,y=volatile.acidity)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
#for white wine we will draw correlation between quality and alcohol,volatile.acidity,sulphates
ggplot(logdfww,aes(x=quality,y=volatile.acidity)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
ggplot(logdfww,aes(x=quality,y=alcohol)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
ggplot(logdfww,aes(x=quality,y=chlorides)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
Use regsubsets from library leaps to choose optimal set of variables for modeling wine quality for red and white wine (separately), describe differences and similarities between attributes deemed important in each case.
#Redwine
summary(regsubsets(quality ~ .,logdfwr,method="exhaustive"))
## Subset selection object
## Call: regsubsets.formula(quality ~ ., logdfwr, method = "exhaustive")
## 11 Variables (and intercept)
## Forced in Forced out
## fixed.acidity FALSE FALSE
## volatile.acidity FALSE FALSE
## citric.acid FALSE FALSE
## residual.sugar FALSE FALSE
## chlorides FALSE FALSE
## free.sulfur.dioxide FALSE FALSE
## total.sulfur.dioxide FALSE FALSE
## density FALSE FALSE
## pH FALSE FALSE
## sulphates FALSE FALSE
## alcohol FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## fixed.acidity volatile.acidity citric.acid residual.sugar
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " "*" " " " "
## 3 ( 1 ) " " "*" " " " "
## 4 ( 1 ) " " "*" " " " "
## 5 ( 1 ) " " "*" " " " "
## 6 ( 1 ) " " "*" " " " "
## 7 ( 1 ) " " "*" " " " "
## 8 ( 1 ) " " "*" "*" " "
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " "
## 5 ( 1 ) "*" " " " " " " "*"
## 6 ( 1 ) "*" " " "*" " " "*"
## 7 ( 1 ) "*" "*" "*" " " "*"
## 8 ( 1 ) "*" "*" "*" " " "*"
## sulphates alcohol
## 1 ( 1 ) " " "*"
## 2 ( 1 ) " " "*"
## 3 ( 1 ) "*" "*"
## 4 ( 1 ) "*" "*"
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
summary(regsubsets(quality ~ .,logdfwr,method="backward"))
## Subset selection object
## Call: regsubsets.formula(quality ~ ., logdfwr, method = "backward")
## 11 Variables (and intercept)
## Forced in Forced out
## fixed.acidity FALSE FALSE
## volatile.acidity FALSE FALSE
## citric.acid FALSE FALSE
## residual.sugar FALSE FALSE
## chlorides FALSE FALSE
## free.sulfur.dioxide FALSE FALSE
## total.sulfur.dioxide FALSE FALSE
## density FALSE FALSE
## pH FALSE FALSE
## sulphates FALSE FALSE
## alcohol FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: backward
## fixed.acidity volatile.acidity citric.acid residual.sugar
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " "*" " " " "
## 3 ( 1 ) " " "*" " " " "
## 4 ( 1 ) " " "*" " " " "
## 5 ( 1 ) " " "*" " " " "
## 6 ( 1 ) " " "*" " " " "
## 7 ( 1 ) " " "*" " " " "
## 8 ( 1 ) " " "*" "*" " "
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " "
## 5 ( 1 ) "*" " " " " " " "*"
## 6 ( 1 ) "*" " " "*" " " "*"
## 7 ( 1 ) "*" "*" "*" " " "*"
## 8 ( 1 ) "*" "*" "*" " " "*"
## sulphates alcohol
## 1 ( 1 ) " " "*"
## 2 ( 1 ) " " "*"
## 3 ( 1 ) "*" "*"
## 4 ( 1 ) "*" "*"
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
summary(regsubsets(quality ~ . ,logdfwr,method="forward"))
## Subset selection object
## Call: regsubsets.formula(quality ~ ., logdfwr, method = "forward")
## 11 Variables (and intercept)
## Forced in Forced out
## fixed.acidity FALSE FALSE
## volatile.acidity FALSE FALSE
## citric.acid FALSE FALSE
## residual.sugar FALSE FALSE
## chlorides FALSE FALSE
## free.sulfur.dioxide FALSE FALSE
## total.sulfur.dioxide FALSE FALSE
## density FALSE FALSE
## pH FALSE FALSE
## sulphates FALSE FALSE
## alcohol FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
## fixed.acidity volatile.acidity citric.acid residual.sugar
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " "*" " " " "
## 3 ( 1 ) " " "*" " " " "
## 4 ( 1 ) " " "*" " " " "
## 5 ( 1 ) " " "*" " " " "
## 6 ( 1 ) " " "*" " " " "
## 7 ( 1 ) " " "*" " " " "
## 8 ( 1 ) " " "*" "*" " "
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " "
## 5 ( 1 ) "*" " " " " " " "*"
## 6 ( 1 ) "*" " " "*" " " "*"
## 7 ( 1 ) "*" "*" "*" " " "*"
## 8 ( 1 ) "*" "*" "*" " " "*"
## sulphates alcohol
## 1 ( 1 ) " " "*"
## 2 ( 1 ) " " "*"
## 3 ( 1 ) "*" "*"
## 4 ( 1 ) "*" "*"
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
summary(regsubsets(quality ~ .,logdfwr,method="seqrep"))
## Subset selection object
## Call: regsubsets.formula(quality ~ ., logdfwr, method = "seqrep")
## 11 Variables (and intercept)
## Forced in Forced out
## fixed.acidity FALSE FALSE
## volatile.acidity FALSE FALSE
## citric.acid FALSE FALSE
## residual.sugar FALSE FALSE
## chlorides FALSE FALSE
## free.sulfur.dioxide FALSE FALSE
## total.sulfur.dioxide FALSE FALSE
## density FALSE FALSE
## pH FALSE FALSE
## sulphates FALSE FALSE
## alcohol FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: 'sequential replacement'
## fixed.acidity volatile.acidity citric.acid residual.sugar
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " "*" " " " "
## 3 ( 1 ) " " "*" " " " "
## 4 ( 1 ) " " "*" " " " "
## 5 ( 1 ) " " "*" " " " "
## 6 ( 1 ) " " "*" " " " "
## 7 ( 1 ) " " "*" " " " "
## 8 ( 1 ) "*" "*" "*" "*"
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " "
## 5 ( 1 ) "*" " " " " " " "*"
## 6 ( 1 ) "*" " " "*" " " "*"
## 7 ( 1 ) "*" "*" "*" " " "*"
## 8 ( 1 ) "*" "*" "*" "*" " "
## sulphates alcohol
## 1 ( 1 ) " " "*"
## 2 ( 1 ) " " "*"
## 3 ( 1 ) "*" "*"
## 4 ( 1 ) "*" "*"
## 5 ( 1 ) "*" "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) " " " "
summary(regsubsets(quality ~ .,logdfwr,method="seqrep"))$which
## (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 1 TRUE FALSE FALSE FALSE FALSE
## 2 TRUE FALSE TRUE FALSE FALSE
## 3 TRUE FALSE TRUE FALSE FALSE
## 4 TRUE FALSE TRUE FALSE FALSE
## 5 TRUE FALSE TRUE FALSE FALSE
## 6 TRUE FALSE TRUE FALSE FALSE
## 7 TRUE FALSE TRUE FALSE FALSE
## 8 TRUE TRUE TRUE TRUE TRUE
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1 FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE FALSE
## 4 TRUE FALSE FALSE FALSE FALSE
## 5 TRUE FALSE FALSE FALSE TRUE
## 6 TRUE FALSE TRUE FALSE TRUE
## 7 TRUE TRUE TRUE FALSE TRUE
## 8 TRUE TRUE TRUE TRUE FALSE
## sulphates alcohol
## 1 FALSE TRUE
## 2 FALSE TRUE
## 3 TRUE TRUE
## 4 TRUE TRUE
## 5 TRUE TRUE
## 6 TRUE TRUE
## 7 TRUE TRUE
## 8 FALSE FALSE
plot(regsubsets(quality ~ .,logdfwr))
summaryMetrics <- NULL
whichAll <- list()
for ( myMthd in c("exhaustive", "backward", "forward") ) {
rsRes <- regsubsets(quality~.,logdfwr,method=myMthd,nvmax=11)
summRes <- summary(rsRes)
whichAll[[myMthd]] <- summRes$which
for ( metricName in c("rsq","rss","adjr2","cp","bic") ) {
summaryMetrics <- rbind(summaryMetrics,
data.frame(method=myMthd,metric=metricName,
nvars=1:length(summRes[[metricName]]),
value=summRes[[metricName]]))
}
}
ggplot(summaryMetrics,aes(x=nvars,y=value,shape=method,colour=method)) + geom_path() + geom_point() + facet_wrap(~metric,scales="free") + theme(legend.position="top")
old.par <- par(mfrow=c(2,2),ps=16,mar=c(5,7,2,1))
for ( myMthd in names(whichAll) ) {
image(1:nrow(whichAll[[myMthd]]),
1:ncol(whichAll[[myMthd]]),
whichAll[[myMthd]],xlab="N(vars)",ylab="",
xaxt="n",yaxt="n",breaks=c(-0.5,0.5,1.5),
col=c("white","gray"),main=myMthd)
axis(1,1:nrow(whichAll[[myMthd]]),rownames(whichAll[[myMthd]]))
axis(2,1:ncol(whichAll[[myMthd]]),colnames(whichAll[[myMthd]]),las=2)
}
#white wine
summary(regsubsets(quality ~ .,logdfww,method="exhaustive"))
## Subset selection object
## Call: regsubsets.formula(quality ~ ., logdfww, method = "exhaustive")
## 11 Variables (and intercept)
## Forced in Forced out
## fixed.acidity FALSE FALSE
## volatile.acidity FALSE FALSE
## citric.acid FALSE FALSE
## residual.sugar FALSE FALSE
## chlorides FALSE FALSE
## free.sulfur.dioxide FALSE FALSE
## total.sulfur.dioxide FALSE FALSE
## density FALSE FALSE
## pH FALSE FALSE
## sulphates FALSE FALSE
## alcohol FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## fixed.acidity volatile.acidity citric.acid residual.sugar
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " "*" " " " "
## 3 ( 1 ) " " "*" " " " "
## 4 ( 1 ) " " "*" " " "*"
## 5 ( 1 ) " " "*" " " "*"
## 6 ( 1 ) " " "*" " " "*"
## 7 ( 1 ) " " "*" " " "*"
## 8 ( 1 ) " " "*" " " "*"
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " "
## 4 ( 1 ) " " "*" " " " " " "
## 5 ( 1 ) " " "*" " " "*" " "
## 6 ( 1 ) " " "*" " " "*" " "
## 7 ( 1 ) " " "*" " " "*" "*"
## 8 ( 1 ) " " "*" "*" "*" "*"
## sulphates alcohol
## 1 ( 1 ) " " "*"
## 2 ( 1 ) " " "*"
## 3 ( 1 ) " " "*"
## 4 ( 1 ) " " "*"
## 5 ( 1 ) " " "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
summary(regsubsets(quality ~ .,logdfww,method="backward"))
## Subset selection object
## Call: regsubsets.formula(quality ~ ., logdfww, method = "backward")
## 11 Variables (and intercept)
## Forced in Forced out
## fixed.acidity FALSE FALSE
## volatile.acidity FALSE FALSE
## citric.acid FALSE FALSE
## residual.sugar FALSE FALSE
## chlorides FALSE FALSE
## free.sulfur.dioxide FALSE FALSE
## total.sulfur.dioxide FALSE FALSE
## density FALSE FALSE
## pH FALSE FALSE
## sulphates FALSE FALSE
## alcohol FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: backward
## fixed.acidity volatile.acidity citric.acid residual.sugar
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " "*" " " " "
## 3 ( 1 ) " " "*" " " " "
## 4 ( 1 ) " " "*" " " "*"
## 5 ( 1 ) " " "*" " " "*"
## 6 ( 1 ) " " "*" " " "*"
## 7 ( 1 ) " " "*" " " "*"
## 8 ( 1 ) " " "*" " " "*"
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " "
## 4 ( 1 ) " " "*" " " " " " "
## 5 ( 1 ) " " "*" " " "*" " "
## 6 ( 1 ) " " "*" " " "*" " "
## 7 ( 1 ) " " "*" " " "*" "*"
## 8 ( 1 ) " " "*" "*" "*" "*"
## sulphates alcohol
## 1 ( 1 ) " " "*"
## 2 ( 1 ) " " "*"
## 3 ( 1 ) " " "*"
## 4 ( 1 ) " " "*"
## 5 ( 1 ) " " "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
summary(regsubsets(quality ~ . ,logdfww,method="forward"))
## Subset selection object
## Call: regsubsets.formula(quality ~ ., logdfww, method = "forward")
## 11 Variables (and intercept)
## Forced in Forced out
## fixed.acidity FALSE FALSE
## volatile.acidity FALSE FALSE
## citric.acid FALSE FALSE
## residual.sugar FALSE FALSE
## chlorides FALSE FALSE
## free.sulfur.dioxide FALSE FALSE
## total.sulfur.dioxide FALSE FALSE
## density FALSE FALSE
## pH FALSE FALSE
## sulphates FALSE FALSE
## alcohol FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
## fixed.acidity volatile.acidity citric.acid residual.sugar
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " "*" " " " "
## 3 ( 1 ) " " "*" " " " "
## 4 ( 1 ) " " "*" " " "*"
## 5 ( 1 ) " " "*" " " "*"
## 6 ( 1 ) " " "*" " " "*"
## 7 ( 1 ) " " "*" " " "*"
## 8 ( 1 ) " " "*" " " "*"
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " "
## 4 ( 1 ) " " "*" " " " " " "
## 5 ( 1 ) " " "*" " " "*" " "
## 6 ( 1 ) " " "*" " " "*" " "
## 7 ( 1 ) " " "*" " " "*" "*"
## 8 ( 1 ) " " "*" "*" "*" "*"
## sulphates alcohol
## 1 ( 1 ) " " "*"
## 2 ( 1 ) " " "*"
## 3 ( 1 ) " " "*"
## 4 ( 1 ) " " "*"
## 5 ( 1 ) " " "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
summary(regsubsets(quality ~ .,logdfww,method="seqrep"))
## Subset selection object
## Call: regsubsets.formula(quality ~ ., logdfww, method = "seqrep")
## 11 Variables (and intercept)
## Forced in Forced out
## fixed.acidity FALSE FALSE
## volatile.acidity FALSE FALSE
## citric.acid FALSE FALSE
## residual.sugar FALSE FALSE
## chlorides FALSE FALSE
## free.sulfur.dioxide FALSE FALSE
## total.sulfur.dioxide FALSE FALSE
## density FALSE FALSE
## pH FALSE FALSE
## sulphates FALSE FALSE
## alcohol FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: 'sequential replacement'
## fixed.acidity volatile.acidity citric.acid residual.sugar
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " "*" " " " "
## 3 ( 1 ) " " "*" " " " "
## 4 ( 1 ) " " "*" " " "*"
## 5 ( 1 ) " " "*" " " "*"
## 6 ( 1 ) " " "*" " " "*"
## 7 ( 1 ) " " "*" " " "*"
## 8 ( 1 ) " " "*" " " "*"
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " "
## 4 ( 1 ) " " "*" " " " " " "
## 5 ( 1 ) " " "*" " " "*" " "
## 6 ( 1 ) " " "*" " " "*" " "
## 7 ( 1 ) " " "*" " " "*" "*"
## 8 ( 1 ) " " "*" "*" "*" "*"
## sulphates alcohol
## 1 ( 1 ) " " "*"
## 2 ( 1 ) " " "*"
## 3 ( 1 ) " " "*"
## 4 ( 1 ) " " "*"
## 5 ( 1 ) " " "*"
## 6 ( 1 ) "*" "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
summary(regsubsets(quality ~ .,logdfww,method="seqrep"))$which
## (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 1 TRUE FALSE FALSE FALSE FALSE
## 2 TRUE FALSE TRUE FALSE FALSE
## 3 TRUE FALSE TRUE FALSE FALSE
## 4 TRUE FALSE TRUE FALSE TRUE
## 5 TRUE FALSE TRUE FALSE TRUE
## 6 TRUE FALSE TRUE FALSE TRUE
## 7 TRUE FALSE TRUE FALSE TRUE
## 8 TRUE FALSE TRUE FALSE TRUE
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1 FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE
## 3 FALSE TRUE FALSE FALSE FALSE
## 4 FALSE TRUE FALSE FALSE FALSE
## 5 FALSE TRUE FALSE TRUE FALSE
## 6 FALSE TRUE FALSE TRUE FALSE
## 7 FALSE TRUE FALSE TRUE TRUE
## 8 FALSE TRUE TRUE TRUE TRUE
## sulphates alcohol
## 1 FALSE TRUE
## 2 FALSE TRUE
## 3 FALSE TRUE
## 4 FALSE TRUE
## 5 FALSE TRUE
## 6 TRUE TRUE
## 7 TRUE TRUE
## 8 TRUE TRUE
plot(regsubsets(quality ~ .,logdfww))
summaryMetrics <- NULL
whichAll <- list()
for ( myMthd in c("exhaustive", "backward", "forward") ) {
rsRes <- regsubsets(quality~.,logdfww,method=myMthd,nvmax=11)
summRes <- summary(rsRes)
whichAll[[myMthd]] <- summRes$which
for ( metricName in c("rsq","rss","adjr2","cp","bic") ) {
summaryMetrics <- rbind(summaryMetrics,
data.frame(method=myMthd,metric=metricName,
nvars=1:length(summRes[[metricName]]),
value=summRes[[metricName]]))
}
}
ggplot(summaryMetrics,aes(x=nvars,y=value,shape=method,colour=method)) + geom_path() + geom_point() + facet_wrap(~metric,scales="free") + theme(legend.position="top")
old.par <- par(mfrow=c(2,2),ps=16,mar=c(5,7,2,1))
for ( myMthd in names(whichAll) ) {
image(1:nrow(whichAll[[myMthd]]),
1:ncol(whichAll[[myMthd]]),
whichAll[[myMthd]],xlab="N(vars)",ylab="",
xaxt="n",yaxt="n",breaks=c(-0.5,0.5,1.5),
col=c("white","gray"),main=myMthd)
axis(1,1:nrow(whichAll[[myMthd]]),rownames(whichAll[[myMthd]]))
axis(2,1:ncol(whichAll[[myMthd]]),colnames(whichAll[[myMthd]]),las=2)
}
All model’s performance are same for all the 11 variables except for the “bic” graph where the variables tend to go up a little bit.
But from the diagrams above including the which attribute of the summary,there are 6 variables which appear to be optimal out of which 3 (alcohol,volatile.acidity,sulpahtes) are more optimal than the other 3(chloride,ph,total SO2).we can see them as they form a kind of straight line towards the end of the curve line.
The 6 variables are : alcohol- This was expected all through the analysis starting from subproblem above. This value seem logical as people buy wine because of alcohol present in them. The more the alcohol quantity the better the quality
sulphate - This is also adding values to the total SO2 variable - sulphur dioxide and it is used to protect the wine. wich acts as an antimicrobial and antioxidant.There are many misnomers around how much quantity is optimal in wine.
Volatile acid - This comes from acetic acid created by bacteria in wine. since acid is directly related to PH values . PH value is also one of the variables chloride - This variable didn’t came up during analysis in subproblem above. according to the literature online [http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0101-20612015000100095} specifically red wine contains chloride and gives a salty taste to the wine.
we can see a total of 6 optimal variuables out which 3 are best optimal variables. alcohol,volatile acidic and free SO2. This is followed by residual sugar,density, and sulphates
alcohol,sulphates,volatile acidic are descrbed above for redwine and hold the same reasoning for white wine also. residual sugar -Total sulfur dioxide and level of residual sugar are positively correlated. Correlation shows higher value with white wine. White wine density and residual sugar level have positive correlation. Alcohol level of white wine decreases with the growth of residual sugar level
SO2 = can be explained by the involvement of sulphates variable density - This is unexpected based on analysis done in subproblem above. The density seems to be correlated with the residual sugar and with the alcohol whcih in turn determine the quality .
Use cross-validation (or any other resampling strategy of your choice) to estimate test error for models with different numbers of variables. Compare and comment on the number of variables deemed optimal by resampling versus those selected by regsubsets in the previous task. Compare resulting models built separately for red and white wine data.
#red wine
predict.regsubsets <- function (object, newdata, id, ...){
form=as.formula(object$call [[2]])
mat=model.matrix(form,newdata)
coefi=coef(object,id=id)
xvars=names (coefi)
mat[,xvars] %*% coefi
}
dfTmp <- NULL
whichSum <- array(0,dim=c(11,12,4),
dimnames=list(NULL,colnames(model.matrix(quality ~ .,logdfwr)),
c("exhaustive", "backward", "forward", "seqrep")))
# Split data into training and test 50 times:
nTries <- 30
for ( iTry in 1:nTries ) {
bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(logdfwr)))
# Try each method available in regsubsets
# to select the best model of each size:
for ( jSelect in c("exhaustive", "backward", "forward", "seqrep") ) {
rsTrain <- regsubsets(quality ~ .,logdfwr[bTrain,],method=jSelect,nvmax=11)
# Add up variable selections:
whichSum[,,jSelect] <- whichSum[,,jSelect] + summary(rsTrain)$which
# Calculate test error for each set of variables
# using predict.regsubsets implemented above:
for ( kVarSet in 1:11 ) {
# make predictions:
testPred <- predict(rsTrain,logdfwr[!bTrain,],id=kVarSet)
# calculate MSE:
mseTest <- mean((testPred-logdfwr[!bTrain,"quality"])^2)
# add to data.frame for future plotting:
dfTmp <- rbind(dfTmp,data.frame(sim=iTry,sel=jSelect,vars=kVarSet,
mse=c(mseTest,summary(rsTrain)$rss[kVarSet]/sum(bTrain)),trainTest=c("test","train")))
}
}
}
# plot MSEs by training/test, number of
# variables and selection method:
ggplot(dfTmp,aes(x=factor(vars),y=mse,colour=sel)) + geom_boxplot()+facet_wrap(~trainTest)
## k-fold cross validation (10 fold)
#method for predict
#now we perform best subset selection on the full data set, and select the best ten-variable model.
regfit.best=regsubsets(quality~.,data=logdfwr ,nvmax=12,,really.big=T)
coef(regfit.best ,11)
## (Intercept) fixed.acidity volatile.acidity
## 5.32737071 0.05895234 -0.27498320
## citric.acid residual.sugar chlorides
## -0.06366609 0.01009791 -0.31584382
## free.sulfur.dioxide total.sulfur.dioxide density
## 0.01615433 -0.02012607 -6.25107126
## pH sulphates alcohol
## -0.22882822 0.26708755 0.46221731
#partitions
k=10
set.seed(1)
folds=sample(1:k,nrow(logdfwr),replace=TRUE)
cv.errors=matrix(NA,k,11, dimnames=list(NULL, paste(1:11)))
for(j in 1:k){
best.fit = regsubsets ( quality ~ . , data=logdfwr [ folds != j , ],nvmax=12)
for(i in 1:11){
pred<-predict(best.fit,logdfwr[folds==j,],id=i)
cv.errors[j,i]=mean( (logdfwr$quality[folds==j]-pred)^2)
}
}
mean.cv.errors=apply(cv.errors ,2,mean)
mean.cv.errors
## 1 2 3 4 5 6
## 0.01199423 0.01058970 0.01026104 0.01017251 0.01017239 0.01017296
## 7 8 9 10 11
## 0.01011884 0.01008689 0.01007228 0.01010477 0.01008996
par(mfrow=c(1,1))
plot(mean.cv.errors ,type="b")
# white wine
dfTmp <- NULL
whichSum <- array(0,dim=c(11,12,4),
dimnames=list(NULL,colnames(model.matrix(quality ~ .,logdfww)),
c("exhaustive", "backward", "forward", "seqrep")))
# Split data into training and test 50 times:
nTries <- 30
for ( iTry in 1:nTries ) {
bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(logdfww)))
# Try each method available in regsubsets
# to select the best model of each size:
for ( jSelect in c("exhaustive", "backward", "forward", "seqrep") ) {
rsTrain <- regsubsets(quality ~ .,logdfww[bTrain,],method=jSelect,nvmax=11)
# Add up variable selections:
whichSum[,,jSelect] <- whichSum[,,jSelect] + summary(rsTrain)$which
# Calculate test error for each set of variables
# using predict.regsubsets implemented above:
for ( kVarSet in 1:11 ) {
# make predictions:
testPred <- predict(rsTrain,logdfww[!bTrain,],id=kVarSet)
# calculate MSE:
mseTest <- mean((testPred-logdfww[!bTrain,"quality"])^2)
# add to data.frame for future plotting:
dfTmp <- rbind(dfTmp,data.frame(sim=iTry,sel=jSelect,vars=kVarSet,
mse=c(mseTest,summary(rsTrain)$rss[kVarSet]/sum(bTrain)),trainTest=c("test","train")))
}
}
}
# plot MSEs by training/test, number of
# variables and selection method:
ggplot(dfTmp,aes(x=factor(vars),y=mse,colour=sel)) + geom_boxplot()+facet_wrap(~trainTest)
## k-fold cross validation (10 fold)
#method for predict
#now we perform best subset selection on the full data set, and select the best ten-variable model.
regfit.best=regsubsets(quality~.,data=logdfww ,nvmax=12,really.big=T)
coef(regfit.best ,11)
## (Intercept) fixed.acidity volatile.acidity
## 13.067606668 -0.003204615 -0.379658910
## citric.acid residual.sugar chlorides
## 0.014138854 0.045812118 -0.154265863
## free.sulfur.dioxide total.sulfur.dioxide density
## 0.043017426 -0.018854729 -18.323800592
## pH sulphates alcohol
## 0.184960019 0.113450742 0.472879995
#partitions
k=10
set.seed(1)
folds=sample(1:k,nrow(logdfww),replace=TRUE)
cv.errors=matrix(NA,k,11, dimnames=list(NULL, paste(1:11)))
for(j in 1:k){
best.fit = regsubsets ( quality ~ . , data=logdfww [ folds != j , ],nvmax=12)
for(i in 1:11){
pred<-predict(best.fit,logdfww[folds==j,],id=i)
cv.errors[j,i]=mean( (logdfww$quality[folds==j]-pred)^2)
}
}
mean.cv.errors=apply(cv.errors ,2,mean)
mean.cv.errors
## 1 2 3 4 5 6
## 0.01402359 0.01310509 0.01253814 0.01234569 0.01234074 0.01227148
## 7 8 9 10 11
## 0.01224258 0.01220519 0.01223475 0.01223085 0.01223612
par(mfrow=c(1,1))
plot(mean.cv.errors ,type="b")
The test data and trained data behave almost identically. we can say that the mode size if 5 as the last 5 boxplots are almost constant.
By looking at the graph it looks like all the four methods yield models of very comparable performance for both the wines. There is a difference in terms of RSE box plot graphs because redwine has significanty less number of observations than white wine.
Error rate is more with test data than the training which could that the process is
moving towards a optimal subset of variables in case of red wine.This should also be related to the number of observations.
density,ph,sulphates and alcohol are seem to be the main predictors. with ph & alcohol to be mre optimal than the other 2. In problem 2 above we had alcohol,volatile.acidity,sulphates as predictor variables.
for white wine there is not much difference test data and training data.
like red wine white wine also has the same variables as optimal variable overall but density & sulphate are more optimal than the others.
in problem 2 we had alcohol,volatile acidic and free SO2 as optimal variables . so there is a difference for optimal variables in both the cases
Use regularized approaches (i.e. lasso and ridge) to model quality of red and white wine (separately). Compare resulting models (in terms of number of variables and their effects) to those selected in the previous two tasks (by regsubsets and resampling), comment on differences and similarities among them.
xl <- model.matrix(quality~.,logdfwr)[,-1]
head(xl)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 2.128232 0.5306283 0.00000000 1.064711 0.07325046
## 2 2.174752 0.6312718 0.00000000 1.280934 0.09349034
## 3 2.174752 0.5653138 0.03922071 1.193922 0.08801088
## 4 2.501436 0.2468601 0.44468582 1.064711 0.07232066
## 5 2.128232 0.5306283 0.00000000 1.064711 0.07325046
## 6 2.128232 0.5068176 0.00000000 1.029619 0.07232066
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 1 2.484907 3.555348 0.6920466 1.506297 0.4446858
## 2 3.258097 4.219508 0.6915459 1.435085 0.5187938
## 3 2.772589 4.007333 0.6916461 1.449269 0.5007753
## 4 2.890372 4.110874 0.6921467 1.425515 0.4574248
## 5 2.484907 3.555348 0.6920466 1.506297 0.4446858
## 6 2.639057 3.713572 0.6920466 1.506297 0.4446858
## alcohol
## 1 2.341806
## 2 2.379546
## 3 2.379546
## 4 2.379546
## 5 2.341806
## 6 2.341806
yl <- logdfwr[,"quality"]
mylassoRes <- glmnet(scale(xl),yl,alpha=1)
plot(mylassoRes,label=TRUE)
mycvLassoRes <- cv.glmnet(scale(xl),yl,alpha=1)
plot(mycvLassoRes)
#log (lambda)
mycvLassoRes <- cv.glmnet(scale(xl),yl,alpha=1,lambda=10^((-120:0)/20))
plot(mycvLassoRes)
#log (large lambda)
mycvLassoRes <- cv.glmnet(scale(xl),yl,alpha=1,lambda=10^((-120:0)/10))
plot(mycvLassoRes)
#log (large lambda)
mycvLassoRes <- cv.glmnet(scale(xl),yl,alpha=1,lambda=10^((-10:5)/5))
plot(mycvLassoRes)
predict(mylassoRes,type="coefficients",s=mycvLassoRes$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.885053743
## fixed.acidity .
## volatile.acidity -0.023272403
## citric.acid .
## residual.sugar .
## chlorides .
## free.sulfur.dioxide .
## total.sulfur.dioxide .
## density .
## pH .
## sulphates 0.007467761
## alcohol 0.034614499
predict(mylassoRes,type="coefficients",s=mycvLassoRes$lambda.min)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.88505374
## fixed.acidity .
## volatile.acidity -0.02693151
## citric.acid .
## residual.sugar .
## chlorides .
## free.sulfur.dioxide .
## total.sulfur.dioxide .
## density .
## pH .
## sulphates 0.01175615
## alcohol 0.03918927
mylassoResScaled <- glmnet(scale(xl),yl,alpha=1)
mycvLassoResScaled <- cv.glmnet(scale(xl),yl,alpha=1)
predict(mylassoResScaled,type="coefficients",s=mycvLassoResScaled$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.885053743
## fixed.acidity .
## volatile.acidity -0.024465228
## citric.acid .
## residual.sugar .
## chlorides .
## free.sulfur.dioxide .
## total.sulfur.dioxide .
## density .
## pH .
## sulphates 0.008865725
## alcohol 0.036105821
For red wine by using lasso - looking at cofficients we can see that 3 variables are supposed to be good predictors (volatile.acidity,sulphates & alcohol) which is exactly matching with the analysis of red wine in subproblem 2 above.
myridgeRes <- glmnet(scale(xl),yl,alpha=0)
plot(myridgeRes,label=TRUE)
mycvRidgeRes <- cv.glmnet(scale(xl),yl,alpha=0)
plot(mycvRidgeRes)
mycvRidgeRes$lambda.min
## [1] 0.006177282
mycvRidgeRes$lambda.1se
## [1] 0.07615642
predict(myridgeRes,type="coefficients",s=mycvRidgeRes$lambda.min)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.885053743
## fixed.acidity 0.010354822
## volatile.acidity -0.029309781
## citric.acid -0.006009319
## residual.sugar 0.003119953
## chlorides -0.012170082
## free.sulfur.dioxide 0.008493006
## total.sulfur.dioxide -0.012602212
## density -0.007926178
## pH -0.006464392
## sulphates 0.024247794
## alcohol 0.038845210
predict(myridgeRes,type="coefficients",s=mycvRidgeRes$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.885053743
## fixed.acidity 0.005646014
## volatile.acidity -0.019888710
## citric.acid 0.004474523
## residual.sugar 0.002102760
## chlorides -0.008842245
## free.sulfur.dioxide 0.002426017
## total.sulfur.dioxide -0.007636518
## density -0.008579098
## pH -0.002586310
## sulphates 0.016580794
## alcohol 0.026320357
mycvRidgeRes <- cv.glmnet(scale(xl),yl,alpha=0,lambda=10^((-80:80)/20))
plot(mycvRidgeRes)
mycvRidgeRes <- cv.glmnet(scale(xl),yl,alpha=0,lambda=10^((-80:80)/5))
plot(mycvRidgeRes)
myridgeResScaled <- glmnet(scale(xl),yl,alpha=0)
mycvRidgeResScaled <- cv.glmnet(scale(xl),yl,alpha=0)
predict(myridgeResScaled,type="coefficients",s=mycvRidgeResScaled$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.885053743
## fixed.acidity 0.005646014
## volatile.acidity -0.019888710
## citric.acid 0.004474523
## residual.sugar 0.002102760
## chlorides -0.008842245
## free.sulfur.dioxide 0.002426017
## total.sulfur.dioxide -0.007636518
## density -0.008579098
## pH -0.002586310
## sulphates 0.016580794
## alcohol 0.026320357
For ridge regression red wine -It still somewhat shows the same result as lasso but the best fit in this case is for all 11 attributes.
xl <- model.matrix(quality~.,logdfww)[,-1]
head(xl)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 2.079442 0.2390169 0.3074847 3.0773123 0.04401689
## 2 1.987874 0.2623643 0.2926696 0.9555114 0.04783733
## 3 2.208274 0.2468601 0.3364722 2.0668628 0.04879016
## 4 2.104134 0.2070142 0.2776317 2.2512918 0.05638033
## 5 2.104134 0.2070142 0.2776317 2.2512918 0.05638033
## 6 2.208274 0.2468601 0.3364722 2.0668628 0.04879016
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 1 3.828641 5.141664 0.6936471 1.386294 0.3715636
## 2 2.708050 4.890349 0.6901427 1.458615 0.3987761
## 3 3.433987 4.584967 0.6906942 1.449269 0.3646431
## 4 3.871201 5.231109 0.6909448 1.432701 0.3364722
## 5 3.871201 5.231109 0.6909448 1.432701 0.3364722
## 6 3.433987 4.584967 0.6906942 1.449269 0.3646431
## alcohol
## 1 2.282382
## 2 2.351375
## 3 2.406945
## 4 2.388763
## 5 2.388763
## 6 2.406945
yl <- logdfww[,"quality"]
mylassoRes <- glmnet(scale(xl),yl,alpha=1)
plot(mylassoRes,label=TRUE)
mycvLassoRes <- cv.glmnet(scale(xl),yl,alpha=1)
plot(mycvLassoRes)
#log (lambda)
mycvLassoRes <- cv.glmnet(scale(xl),yl,alpha=1,lambda=10^((-120:0)/20))
plot(mycvLassoRes)
#log (large lambda)
mycvLassoRes <- cv.glmnet(scale(xl),yl,alpha=1,lambda=10^((-120:0)/10))
plot(mycvLassoRes)
#log (large lambda)
mycvLassoRes <- cv.glmnet(scale(xl),yl,alpha=1,lambda=10^((-10:5)/5))
plot(mycvLassoRes)
predict(mylassoRes,type="coefficients",s=mycvLassoRes$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.919907e+00
## fixed.acidity .
## volatile.acidity -1.862303e-02
## citric.acid .
## residual.sugar 5.734653e-05
## chlorides .
## free.sulfur.dioxide 1.287672e-02
## total.sulfur.dioxide .
## density .
## pH .
## sulphates .
## alcohol 4.888323e-02
predict(mylassoRes,type="coefficients",s=mycvLassoRes$lambda.min)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.919907e+00
## fixed.acidity .
## volatile.acidity -1.862303e-02
## citric.acid .
## residual.sugar 5.734653e-05
## chlorides .
## free.sulfur.dioxide 1.287672e-02
## total.sulfur.dioxide .
## density .
## pH .
## sulphates .
## alcohol 4.888323e-02
mylassoResScaled <- glmnet(scale(xl),yl,alpha=1)
mycvLassoResScaled <- cv.glmnet(scale(xl),yl,alpha=1)
predict(mylassoResScaled,type="coefficients",s=mycvLassoResScaled$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.9199069221
## fixed.acidity -0.0011462225
## volatile.acidity -0.0219211915
## citric.acid .
## residual.sugar 0.0040690778
## chlorides -0.0006769184
## free.sulfur.dioxide 0.0152813087
## total.sulfur.dioxide .
## density .
## pH .
## sulphates .
## alcohol 0.0538604059
For Lasso regression white wine shows that the good predictors are volatile.acidity,residual.sugar,free.sulfur.dioxide and alcohol which is little different than the subproblem 2 above
myridgeRes <- glmnet(scale(xl),yl,alpha=0)
plot(myridgeRes,label=TRUE)
mycvRidgeRes <- cv.glmnet(scale(xl),yl,alpha=0)
plot(mycvRidgeRes)
mycvRidgeRes$lambda.min
## [1] 0.006014445
mycvRidgeRes$lambda.1se
## [1] 0.04243072
predict(myridgeRes,type="coefficients",s=mycvRidgeRes$lambda.min)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.919906922
## fixed.acidity -0.002085572
## volatile.acidity -0.026917224
## citric.acid 0.001508390
## residual.sugar 0.026852434
## chlorides -0.004478281
## free.sulfur.dioxide 0.021660684
## total.sulfur.dioxide -0.005914855
## density -0.021628782
## pH 0.005266119
## sulphates 0.007647940
## alcohol 0.048681719
predict(myridgeRes,type="coefficients",s=mycvRidgeRes$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.919906922
## fixed.acidity -0.003956881
## volatile.acidity -0.020311641
## citric.acid 0.002204200
## residual.sugar 0.013886076
## chlorides -0.007769864
## free.sulfur.dioxide 0.016841601
## total.sulfur.dioxide -0.004259868
## density -0.015480875
## pH 0.003731517
## sulphates 0.005528480
## alcohol 0.036359717
mycvRidgeRes <- cv.glmnet(scale(xl),yl,alpha=0,lambda=10^((-80:80)/20))
plot(mycvRidgeRes)
mycvRidgeRes <- cv.glmnet(scale(xl),yl,alpha=0,lambda=10^((-80:80)/5))
plot(mycvRidgeRes)
myridgeResScaled <- glmnet(scale(xl),yl,alpha=0)
mycvRidgeResScaled <- cv.glmnet(scale(xl),yl,alpha=0)
predict(myridgeResScaled,type="coefficients",s=mycvRidgeResScaled$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.919906922
## fixed.acidity -0.003956881
## volatile.acidity -0.020311641
## citric.acid 0.002204200
## residual.sugar 0.013886076
## chlorides -0.007769864
## free.sulfur.dioxide 0.016841601
## total.sulfur.dioxide -0.004259868
## density -0.015480875
## pH 0.003731517
## sulphates 0.005528480
## alcohol 0.036359717
For Ridge regression white wine has still volatile.acidity,density,residual-sugar,alcohol as good predictors but according the lambda diagrams above we can see that all 11 variables are needed to get optimal lambda value which is not same as lasse regression and is not also not agreeging with the findings of subproblem 2 & 3 above
Merge data for red and white wine (function rbind allows merging of two matrices/data frames with the same number of columns) and plot data projection to the first two principal components (e.g. biplot or similar plots). Does this representation suggest presence of clustering structure in the data? Does wine type (i.e. red or white) or quality appear to be associated with different regions occupied by observations in the plot? Please remember not to include quality attribute or wine type (red or white) indicator in your merged data, otherwise, apparent association of quality or wine type with PCA layout will be influenced by presence of those indicators in your data.
#Merge the 2 wines and perform initial study of data
comwine<-rbind(logdfwr[,-12],logdfww[,-12])
dim(comwine)
## [1] 6497 11
head(comwine)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 2.128232 0.5306283 0.00000000 1.064711 0.07325046
## 2 2.174752 0.6312718 0.00000000 1.280934 0.09349034
## 3 2.174752 0.5653138 0.03922071 1.193922 0.08801088
## 4 2.501436 0.2468601 0.44468582 1.064711 0.07232066
## 5 2.128232 0.5306283 0.00000000 1.064711 0.07325046
## 6 2.128232 0.5068176 0.00000000 1.029619 0.07232066
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 1 2.484907 3.555348 0.6920466 1.506297 0.4446858
## 2 3.258097 4.219508 0.6915459 1.435085 0.5187938
## 3 2.772589 4.007333 0.6916461 1.449269 0.5007753
## 4 2.890372 4.110874 0.6921467 1.425515 0.4574248
## 5 2.484907 3.555348 0.6920466 1.506297 0.4446858
## 6 2.639057 3.713572 0.6920466 1.506297 0.4446858
## alcohol
## 1 2.341806
## 2 2.379546
## 3 2.379546
## 4 2.379546
## 5 2.341806
## 6 2.341806
colnames(comwine)
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol"
pca.out<-prcomp(comwine,scale=TRUE)
plot(pca.out)
biplot(pca.out,scale=TRUE)
PCA analysis
By looking at the biplot we dont see any presence of clustering. all the data points are concentrated in the center
PC1 places more importance to citric acid, SO2,alcohol & quality
PC2 places more importance to density,chlorides and volatile acidity both do not give importance to PH value.
quality of wine appears to be associated more to PC1.
By the row numbers displayed in the biplot we can see that the wine types are spread across out mostly but closely looking we see that white wine data it looks like densities, chlorides ,sulphides , ph values and residual sugar determine white wne where as density chlorides determine red wine. which is slightly different than the abpve analysis.
Compute PCA representation of the data for one of the wine types (red or white) excluding wine quality attribute (of course!). Use resulting principal components (slot x in the output of prcomp) as new predictors to fit a linear model of wine quality as a function of these predictors. Compare resulting fit (in terms of MSE, r-squared, etc.) to those obtained above. Comment on the differences and similarities between these fits.
#modelling the wine quality using prncipal components for red wine.
pca.out<-prcomp(logdfwr[,-12],scale=TRUE)
summary(pca.out)
## Importance of components%s:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7717 1.4190 1.2565 1.0879 1.00548 0.81051 0.74878
## Proportion of Variance 0.2854 0.1831 0.1435 0.1076 0.09191 0.05972 0.05097
## Cumulative Proportion 0.2854 0.4684 0.6119 0.7195 0.81145 0.87117 0.92214
## PC8 PC9 PC10 PC11
## Standard deviation 0.62261 0.50675 0.39714 0.23301
## Proportion of Variance 0.03524 0.02335 0.01434 0.00494
## Cumulative Proportion 0.95738 0.98073 0.99506 1.00000
plot(pca.out)
biplot(pca.out,scale=TRUE)
pca.out$x[1:10,]
## PC1 PC2 PC3 PC4 PC5 PC6
## 1 -1.6605340 0.6960741 1.6430079 0.13548841 0.13615726 0.98752020
## 2 -0.7849915 2.0585544 0.7822421 0.41676005 -0.26291028 -0.79618333
## 3 -0.7358896 1.2231335 0.9462328 0.38158913 -0.04795376 -0.30556219
## 4 2.2587260 0.1480034 -0.6244777 -0.54372152 1.87183292 0.08243416
## 5 -1.6605340 0.6960741 1.6430079 0.13548841 0.13615726 0.98752020
## 6 -1.6545393 0.8790980 1.3629100 0.15620118 0.34250052 1.03589340
## 7 -1.2123682 0.9036645 0.9377109 -0.02711461 1.53163373 -0.20708960
## 8 -2.4474399 -0.4207818 0.9579970 0.47354052 1.31214704 -0.29664324
## 9 -1.0890793 -0.3629164 1.5691287 0.17030774 0.35978883 0.54137213
## 10 0.7044294 1.5579468 -1.1274291 -1.45818109 -1.87639791 0.46087576
## PC7 PC8 PC9 PC10 PC11
## 1 -0.12735849 0.32103329 -0.25263921 -0.26807751 0.04148620
## 2 -1.18330532 -0.81089290 -0.28244916 0.02653793 -0.04237161
## 3 -0.73740570 -0.52008093 -0.07436388 -0.26507990 -0.04758413
## 4 0.34417641 0.46538243 -0.12423025 -0.24027108 0.23032457
## 5 -0.12735849 0.32103329 -0.25263921 -0.26807751 0.04148620
## 6 -0.09359533 0.36077583 -0.34316817 -0.34703528 0.01491793
## 7 0.02786749 -0.08789045 -0.18051734 -0.47862964 0.08808856
## 8 -0.23927040 0.06877128 -0.70414027 0.29283533 0.17945317
## 9 0.08029610 -0.46636010 -0.60756237 0.03419638 0.11330259
## 10 0.34061746 -0.79112784 0.94011673 -0.13286267 0.13648525
mww<-lm(logdfwr$quality ~ PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10+PC11,as.data.frame.matrix(pca.out$x))
summary(mww)
##
## Call:
## lm(formula = logdfwr$quality ~ PC1 + PC2 + PC3 + PC4 + PC5 +
## PC6 + PC7 + PC8 + PC9 + PC10 + PC11, data = as.data.frame.matrix(pca.out$x))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51177 -0.05083 -0.00499 0.06926 0.27889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.885054 0.002489 757.209 < 2e-16 ***
## PC1 0.007937 0.001406 5.647 1.93e-08 ***
## PC2 -0.030493 0.001755 -17.376 < 2e-16 ***
## PC3 -0.040046 0.001982 -20.206 < 2e-16 ***
## PC4 -0.006537 0.002289 -2.856 0.00435 **
## PC5 -0.010747 0.002477 -4.339 1.52e-05 ***
## PC6 0.003919 0.003072 1.275 0.20234
## PC7 -0.017072 0.003326 -5.133 3.20e-07 ***
## PC8 -0.012640 0.004000 -3.160 0.00161 **
## PC9 -0.028569 0.004914 -5.814 7.38e-09 ***
## PC10 -0.007890 0.006270 -1.258 0.20846
## PC11 -0.005347 0.010687 -0.500 0.61696
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09955 on 1587 degrees of freedom
## Multiple R-squared: 0.3468, Adjusted R-squared: 0.3423
## F-statistic: 76.61 on 11 and 1587 DF, p-value: < 2.2e-16
Comparing the model of red wine using principal components with log transformed of red wine we can see that both RSE and RS^2 are having the same values. Although the slopes of the coefficients change which could be because PC1 is a high variance values. From the the summary above, we can undersand PC1 explains 29% of variance and PC2 explains 18% and so on.